In this notebook, we'll try to reproduce What Happens To BERT Embeddings During Fine-tuning? which was accepted by EMNLP2020, Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
In this section, we load wiki dataset, bert-base-uncased which is a pretrained model announced by Google, bert-mnli which was finetuned by glue/mnli dataset and bert-squad which is finetuned by squad.
import os, sys
import re
from typing import Dict
import tensorflow as tf
sys.path.append(os.path.abspath('..'))
import tensorflow as tf
import numpy as np
import plotly
from transformers import TFBertModel, BertTokenizer, BertConfig
import tensorflow_datasets as tfds
import pandas as pd
import plotly.express as px
from bert_repro.plot import BertComparator
plotly.offline.init_notebook_mode()
bert_base_config = BertConfig(
output_hidden_states=True,
output_attentions=True
)
bert_mnli_config = BertConfig(
output_hidden_states=True,
output_attentions=True
)
bert_squad_config = BertConfig(
output_hidden_states=True,
output_attentions=True
)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_base = TFBertModel.from_pretrained('bert-base-uncased', config=bert_base_config)
bert_mnli = TFBertModel.from_pretrained("../models/mnli", config=bert_mnli_config)
bert_squad = TFBertModel.from_pretrained('../models/squad', config=bert_squad_config)
Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls'] - This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training. Some layers from the model checkpoint at ../models/mnli were not used when initializing TFBertModel: ['dropout_37', 'classifier'] - This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). All the layers of TFBertModel were initialized from the model checkpoint at ../models/mnli. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training. Some layers from the model checkpoint at ../models/squad were not used when initializing TFBertModel: ['qa_outputs'] - This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some layers of TFBertModel were not initialized from the model checkpoint at ../models/squad and are newly initialized: ['bert/pooler/dense/bias:0', 'bert/pooler/dense/kernel:0'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
wiki = tfds.load('wiki40b/en', data_dir='../data')
INFO:absl:Load dataset info from ../data/wiki40b/en/1.3.0 INFO:absl:Field info.config_name from disk and from code do not match. Keeping the one from code. INFO:absl:Field info.config_description from disk and from code do not match. Keeping the one from code. INFO:absl:Field info.splits from disk and from code do not match. Keeping the one from code. INFO:absl:Field info.module_name from disk and from code do not match. Keeping the one from code. INFO:absl:Reusing dataset wiki40b (../data/wiki40b/en/1.3.0) INFO:absl:Constructing tf.data.Dataset for split None, from ../data/wiki40b/en/1.3.0
We used pretrained model & finetuned models to get hidden states of layers and compare their cosine similarity.
bc_mnli = BertComparator(bert_base, bert_mnli, tokenizer)
bc_squad = BertComparator(bert_base, bert_squad, tokenizer)
mnli_result = bc_mnli.get_layer_cos_sim_in_data(wiki['train'])
squad_result = bc_squad.get_layer_cos_sim_in_data(wiki['train'])
results = [(l_idx, sim, 'base_squad') for l_idx, sim in enumerate(squad_result)] + \
[(l_idx, sim, 'base_mnli') for l_idx, sim in enumerate(mnli_result)]
df = pd.DataFrame(results, columns=['layer_index', 'similarity', 'versus'])
px.line(df, x='layer_index', y='similarity', color='versus')
Due to the above figure, we can conclude that
Structural probe is a method to evaluate whether a word representation model learns syntax structure in paragraphs. The main idea is, it supposes that syntax tree structures can be remained after linear projecting. The following figure shows this concept intuitively.
So, how do we evaluate it? the answer is, we can train a probe model to predict the number of edges between every pair tokens
The following figure shows that we can use a probe model to project word representations to the subspace which persists the syntax tree structure of the sentence.